Optimal giving up result pilot summary
Data
Here we are comparing different pilot versions of the optimal stopping experiment. We are trying to replicate the result from Costermans et al (1992) in which reaction time on omission trials (when no response is given) is increasing with feeling of knowing (FOK) reports. We want to show this effect in our paradigm, replacing the FOK report with the memory strength measurement based on earlier recall performance.
We’ve run many versions of the experiment and we get mixed results. This report summarizes all the versions. The critical dimension appears to be the instructions, given in full below. To summarize, in the “leading instructions” we encourage participants to skip trials in two ways: 1. We instruct “So if you don’t think you know the word, it might be best to quickly skip the trial to get the time bonus and avoid the error penalty.” 2. We have a quiz question “If you don’t know the word you should…” with the correct answer “guess a random word”.
We’ve also run a pilot with a higher bonus rate (5¢ vs 3¢ per correct response) and one with no time incentive on the critical trials. But these changes don’t seem to make as much of a difference.
Response types
Below are the distribution of response types in the critical trials, in aggregate and by participant. This is before any exclusions.
all_trials %>%
count(pilot, response_type) %>%
pivot_wider(names_from=response_type, values_from=n) %>%
replace(is.na(.), 0) %>%
pivot_longer(!c(pilot, pilot), names_to="response_type", values_to="n", names_prefix="") %>%
group_by(pilot) %>% mutate(prop = n/sum(n)) %>% ungroup() %>%
ggplot(aes(pilot, prop, fill=fct_rev(response_type))) +
geom_bar(stat="identity") +
scale_colour_manual(values=c(
"deeppink2",
"deeppink4",
"gray",
"springgreen4"
), aesthetics=c("fill", "colour"), name="") +
labs(x="", y="Proportion of Trials") + coord_flip()response_colors = scale_colour_manual(values=list(
other="deeppink2",
intrusion="deeppink4",
empty="gray",
correct="springgreen4"
), aesthetics=c("fill", "colour"), name="")
all_trials %>%
count(pilot, wid, response_type) %>%
pivot_wider(names_from=response_type, values_from=n) %>%
replace(is.na(.), 0) %>%
mutate(wid=fct_reorder(wid, 100*correct + empty)) %>%
pivot_longer(!c(pilot, wid), names_to="response_type", values_to="n", names_prefix="") %>%
group_by(pilot, wid) %>% mutate(prop = n/sum(n)) %>% ungroup() %>%
ggplot(aes(wid, prop, fill=fct_rev(response_type))) +
geom_bar(stat="identity") +
response_colors +
labs(x="Participant", y="Proportion of Trials") +
scale_x_discrete(breaks=NULL) +
facet_wrap(~pilot, scales="free_x")We see that:
- there are few errors (incorrect responses) in all pilots
- empty responses are common in all pilots
- there are more correct responses when the critical trials were more heavily incentivized (5¢ vs 3¢)
- theare are more empty responses with leading instructions
Exclusions
Participants who always or never skip (give an empty response) don’t provide useful data. We thus exclude participants who don’t have at least 4 skip and non-skip trials out of the 37 critical trials. Note that we end up excluding more participants with the leading instructions because more participants skip almost every trial.
all_trials$name = 'Human'
excl = all_trials %>%
group_by(pilot,wid) %>%
summarise(n_skip = sum(skip)) %>%
mutate(keep=between(n_skip, 4, 33))
excl %>%
mutate(excluded = if_else(keep, "included", "excluded")) %>%
ggplot(aes(pilot, fill=excluded)) +
geom_bar() +
ylab("Number of Participants") +
scale_colour_manual(values=c(
"gray", "dodgerblue"
), aesthetics=c("fill", "colour"), name="") + coord_flip() + xlab("")keep = excl %>% filter(keep) %>% with(wid)
trials = all_trials %>%
filter(wid %in% keep) %>%
group_by(wid) %>%
mutate(rt_z = zscore(rt)) %>%
ungroup()
pretest = pretest %>% filter(wid %in% keep)
pretest %>%
filter(block == max(block)) %>%
rename(pre_correct = correct) %>%
mutate(pre_logrt = if_else(pre_correct, log(rt), 0)) %>%
group_by(wid, word) %>%
summarise(across(c(pre_correct, pre_logrt), mean)) %>%
group_by(wid) %>%
mutate(across(c(pre_correct, pre_logrt), zscore, .names="{.col}_z")) %>%
mutate(
raw_strength = -((1-pre_correct) * log(15000) + pre_logrt),
strength = zscore(raw_strength)
) %>%
right_join(trials) -> trialsSubjective judgements
The most direct replication of Costermans et al.: How does reaction time depend on explicit reports of confidence and feeling of knowing, given after a response is made?
Confidence for correct responses
How confident are you in your response?
Press a number between 1 and 5.
1 I am not at all sure my response is correct
2 I am not so sure my response is correct
3 I am more or less sure my response is correct
4 I am nearly sure my response is correct
5 I am absolutely sure my response is correct
trials %>% #plot
filter(response_type == "correct") %>%
regress(judgement, rt) +
xlab("Confidence Judgement")| pilot | estimate | std.error | p.value |
|---|---|---|---|
| high bonus | -142.4327 | 70.6379 | 0.0784 |
| leading instructions | -90.3989 | 35.6747 | 0.0133 |
| no time bonus | -492.4245 | 189.7270 | 0.0318 |
| standard | -216.7514 | 89.5107 | 0.0506 |
All in the right direction. The effect is much larger when speed is not incentivized.
FOK for empty responses
How much do you feel that you know the word?
Press a number between 1 and 5.
1 I am absolutely sure I do not know the word
2 I am rather sure I do not know the word
3 I have a vague impression I know the word
4 I am rather sure I know the word
5 I am absolutely sure I know the word
trials %>% #plot
filter(skip) %>%
regress(judgement, rt, bins=0, bin_range=1) +
stat_summary(fun.data=mean_cl_boot, size=.2) +
xlab("Feeling of Knowing Judgement")| pilot | estimate | std.error | p.value |
|---|---|---|---|
| high bonus | 314.4393 | 110.9809 | 0.0345 |
| leading instructions | 193.8678 | 60.0537 | 0.0134 |
| no time bonus | 1125.4998 | 318.6457 | 0.0053 |
| standard | 270.2153 | 169.8991 | 0.2695 |
All in the right direction!
Conclusion: We consistently replicate the Costermans finding with explicit judgements. Results are strongest when speed is not incentivized.
Objective memory strength measure
We can now ask the same thing, using performance on the pretest as an objective measure of the strength of each memory. The strength of each pair is defined as the negative average log reaction time on the two pretest exposures, where an inaccurate response counts as the maximum reaction time.
| pilot | estimate | std.error | p.value |
|---|---|---|---|
| high bonus | -160.2295 | 42.8961 | 0.0003 |
| leading instructions | -248.5228 | 58.3524 | 0.0002 |
| no time bonus | -427.9684 | 125.2094 | 0.0038 |
| standard | -276.4942 | 107.0522 | 0.0205 |
We consistently see faster responses for higher-strength cues. This is unsurprising.
Reaction time on empty trials
Here’s the critical effect.
trials %>% #plot
filter(skip) %>%
regress(strength, rt) +
coord_cartesian(xlim=c(NULL), ylim=c(0, 4000))| pilot | estimate | std.error | p.value |
|---|---|---|---|
| high bonus | -126.1001 | 146.0841 | 0.4337 |
| leading instructions | 112.0613 | 40.2186 | 0.0181 |
| no time bonus | -57.8315 | 216.7666 | 0.8229 |
| standard | 116.5285 | 91.0194 | 0.2251 |
The effect only comes out with leading instructions. In two pilots, the effect was actually negaive (although given the size of the std error, we can’t take this too seriously).
The effect is a bit more robust, when we z-score strength and reaction time within participant (specifically for skip trials), but we still only see it with leading instructions.
trials %>% #plot
filter(skip) %>%
group_by(wid) %>%
filter(sd(strength) != 0) %>%
mutate(strength=zscore(strength), rt_z = zscore(rt)) %>%
regress(strength, rt_z) +
coord_cartesian(xlim=c(NULL), ylim=c(-1.5, 1.5))| pilot | estimate | std.error | p.value |
|---|---|---|---|
| high bonus | 0.0820 | 0.1320 | 0.5960 |
| leading instructions | 0.1864 | 0.0513 | 0.0015 |
| no time bonus | -0.0443 | 0.0818 | 0.5887 |
| standard | 0.0433 | 0.1039 | 0.6898 |
This is not just some quirk of sample. I actually ran three identical versions of the “leading instructions” experiment, and all three of them show a strong effect, with no other sample showing an close to the same size (with z-scoring):
trials %>%
filter(skip) %>%
group_by(wid) %>%
filter(sd(strength) != 0) %>%
mutate(strength=zscore(strength), rt_z = zscore(rt)) %>%
group_by(version) %>%
group_modify(function(data, grp) {
lmer(rt_z ~ strength + (strength|wid), data=data) %>% tidy
}) %>%
filter(term == "strength") %>%
select(version, estimate, std.error, p.value) %>%
left_join(
load_data('participants') %>% count(version)
) %>% kable(digits=4)| version | estimate | std.error | p.value | n |
|---|---|---|---|---|
| v6.5 | 0.2062 | 0.1261 | 0.1502 | 10 |
| v6.5B | 0.2264 | 0.0933 | 0.0169 | 10 |
| v6.5C | 0.1735 | 0.0703 | 0.0311 | 20 |
| v6.6 | 0.0841 | 0.1354 | 0.5960 | 10 |
| v6.7 | 0.0442 | 0.1061 | 0.6898 | 20 |
| v6.8 | -0.0454 | 0.0838 | 0.5887 | 20 |
What’s going on here?
Why do we only get the strength effect with the leading instructions?
Individual regressions
Our first clue comes from looking at the effect separately for each participant:
trials %>%
group_by(wid) %>%
filter(skip) %>%
ggplot(aes(strength, rt, group=wid)) +
geom_smooth(method="lm", level=0, size=.5, color="black") +
facet_wrap(~pilot) +
pretty_labs("strength", "rt")In the leading instructions group, the slope varies considerably, but most people are in the right direction. But wait—why are there so few lines in the other groups? It turns out that many participants in those conditions have no variance in strength on skip trials, which means we can’t run a regression. This happens when a participant only skips on images for which they answered incorrectly in both pretest trials, resulting in the minimum possible strength value.
Here’s a table summarizing the number and proportion of participants who skip on at least one trial without minimum strength. (Note that it matches the number of lines in the regression plot above).
trials %>%
group_by(pilot, wid) %>%
filter(skip) %>%
summarise(x=mean(pre_correct==0)) %>%
group_by(pilot) %>%
summarise(n=sum(x != 1), prop=mean(x != 1)) %>% kable(digits=2)| pilot | n | prop |
|---|---|---|
| high bonus | 5 | 0.56 |
| leading instructions | 25 | 0.89 |
| no time bonus | 8 | 0.53 |
| standard | 9 | 0.56 |
Metacognitive threshold
What seeems to be happening is that the skipping threshold is a bit lower (more lenient) in the leading instructions group. We can see this in an aggregate logistic regression:
trials %>%
ggplot(aes(strength, 1*skip, color=pilot)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), formula=y~x) +
pretty_labs("strength", "probability of skipping")It’s even more clear if we show individuals’ curves, using the raw (un-normalized) strength measure. The red line is the minimum possible strength, corresponding to two incorrect responses.
trials %>%
ggplot(aes(raw_strength, 1*skip, group=wid)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), formula=y~x,
se=F, color="black", size=.5) +
facet_wrap(~pilot) +
theme(
panel.grid.major.x = element_line(color="gray"),
panel.grid.major.y = element_line(color="gray"),
) + geom_vline(xintercept=-log(15000), color="red2") +
pretty_labs("raw strength", "probability of skipping")Where does that leave us? It seems that the effect does exist, but that it is obscured when people only skip minimum-strength words. There are a few strategies we could take:
- Use the leading instructions. This is the surest thing, but it might draw suspicion. Personally, I think it’s fine to encourage participants a bit here, as the effect still depends on their having the metacognitive ability. But I can imagine this causing problems in review.
- Use the standard instructions, exclude participants who only skip minimum-strength trials and plan to collect a very large sample.
- Try to lower peoples’ skipping threshold in some other way. For example, we could use a stricter error penalty.
- Use a strength measure that is more sensitive at the low end. For example, we could try using 2AFC.
Conclusion: We get the critical result of slower skipping for higher strength targets, but only when we encourage skipping in the instructions. It looks like the critical difference is that the instructions lower the skipping threshold, giving us more range of strength in the skip trials.
Metacognitive accuracy
A secondary question we can ask is to what extent the metacognitive judgement correlates with the objective memory strength measure. Overall, it looks like it does.
trials %>%
filter(skip | correct) %>%
group_by(wid) %>%
filter(sd(judgement) != 0) %>%
mutate(judgement=zscore(judgement), strength=zscore(strength)) %>%
ggplot(aes(judgement, strength, color=judgement_type)) +
geom_point(size=.5) +
geom_smooth(method="lm", se=F) +
facet_wrap(~pilot) +
labs(x="Judgement (z-scored)", "Strength (z-scored)")However, when we look at just the FOK trials, we don’t actually see a consistent relationship within individuals (each thin line is a person) and the thick line is a mixed effects fit.
trials %>% #plot
filter(skip) %>%
regress(judgement, strength, bins=0, bin_range=1) +
# stat_summary(fun.data=mean_cl_boot, size=.2) +
stat_summary(aes(group=wid), fun.y=mean, size=.2, geom="line") +
xlab("Feeling of Knowing Judgement")| pilot | estimate | std.error | p.value |
|---|---|---|---|
| high bonus | 0.0027 | 0.0200 | 0.8944 |
| leading instructions | 0.0278 | 0.0286 | 0.3582 |
| no time bonus | -0.0022 | 0.0222 | 0.9230 |
| standard | 0.2295 | 0.1527 | 0.1829 |
Here is the regression coefficient in a strength-by-FOK model fit separately to each participant, excluding those who always give the same FOK rating.
fok = trials %>%
filter(judgement_type == "fok") %>%
group_by(wid) %>%
filter(sd(judgement) != 0)
X = fok %>%
group_modify(function(data, grp) {
lm(strength ~ judgement, data=data) %>% tidy(conf.int=T)
}) %>%
filter(term == "judgement") %>%
left_join(select(excl, wid, pilot)) %>%
arrange(estimate) %>%
ungroup() %>%
mutate(wid=fct_reorder(wid, estimate))
X %>%
ggplot(aes(estimate, wid, xmin=conf.low, xmax=conf.high, color=pilot)) +
geom_pointrange() + geom_vline(xintercept=0) +
theme(legend.position="top")Very few participants show a positive relationship. It looks like part of the reason for this is that participants are not using the full scale:
trials %>%
filter(skip) %>%
ggplot(aes(judgement, ..prop..)) +
geom_bar() +
facet_wrap(~pilot) +
labs(x="Feeling of Knowing Judgement", y="Proportion of Trials")Conclusion: We don’t see a consistent relationship between the FOK judgement and memory strength. However, it seems likely that this is because people aren’t able to use the scale in a meaningful way, rather than because they really don’t have any sense of their memory strength.
Miscellaneous
Pretest accuracy
pretest %>%
group_by(pilot, wid) %>%
summarise(accuracy=mean(correct)) %>%
ggplot(aes(pilot, accuracy)) +
geom_quasirandom(color="gray") +
stat_summary(fun.data=mean_cl_boot) +
coord_flip() +
xlab("") +
ylim(0, 1)Critical trials reaction time
trials %>%
mutate(response_type = case_when(
response_type == "correct" ~ "correct",
response_type == "empty" ~ "empty",
TRUE ~ "error"
)) %>%
ggplot(aes(pilot, rt, color=response_type)) +
# geom_quasirandom(color="gray") +
stat_summary(fun.data=mean_cl_boot) +
coord_flip() +
scale_colour_manual(values=c(
"springgreen4",
"gray",
"deeppink3"
), aesthetics=c("color"), name="") +
xlab("")Instructions screenshots
Standard instructions
Leading instructions